home *** CD-ROM | disk | FTP | other *** search
- Network Working Group K. Moore
- Internet Draft University of Tennessee
- 22 March 1993
-
-
- MIME (Multipurpose Internet Mail Extensions) Part Two:
- Message Header Extensions for Non-ASCII Text
-
-
- Status of this Memo
-
- This document is an Internet Draft. Internet Drafts are working
- documents of the Internet Engineering Task Force (IETF), its Areas,
- and its Working Groups. Note that other groups may also distribute
- working documents as Internet Drafts.
-
- Internet Drafts are valid for a maximum of six months and may be
- updated, replaced, or obsoleted by other documents at any time. (The
- file 1id-abstracts.txt, available via anonymous ftp from nic.ddn.mil,
- describes the current status of each Internet Draft.) It is not
- appropriate to use an Internet Draft as reference material or to cite
- one other than as a "work in progress".
-
- This document is a revision of RFC 1342. If approved by the IETF
- message format extensions working group, it will be submitted to the
- IESG as a candidate for Draft Standard status. Distribution of this
- memo is unlimited. Please send comments to
- ietf-822@dimacs.rutgers.edu.
-
-
- Abstract
-
- This memo describes an extension to the message format defined in
- RFC 1341++ [1], to allow the representation of character sets other
- than ASCII in RFC 822 message headers. The extensions described were
- designed to be highly compatible with existing Internet mail handling
- software, and to be easily implemented in mail readers that support
- RFC 1341++.
-
-
- Introduction
-
- RFC 1341++ describes a mechanism for denoting textual body parts
- which are coded in various character sets, as well as methods for
- encoding such body parts as sequences of printable ASCII characters.
- This memo describes similar techniques to allow the encoding of non-
- ASCII text in various portions of a RFC 822 [2] message header, in a
- manner which is unlikely to confuse existing message handling
- software.
-
- Like the encoding techniques described in RFC 1341++, the techniques
- outlined here were designed to allow the use of non-ASCII characters
- in message headers in a way which is unlikely to be disturbed by the
- quirks of existing Internet mail handling programs. In particular,
- some mail relaying programs are known to (a) delete some message
-
-
-
- K. Moore [Page 1]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- header fields while retaining others, (b) rearrange the order of
- addresses in To or Cc fields, (c) rearrange the (vertical) order of
- header fields, and/or (d) "wrap" message headers at different places
- than those in the original message. In addition, some mail reading
- programs are known to have difficulty correctly parsing message
- headers which, while legal according to RFC 822, make use of
- backslash-quoting to "hide" special characters such as "<", ",", or
- ":", or which exploit other infrequently-used features of that
- specification.
-
- While it is unfortunate that these programs do not correctly
- interpret RFC 822 headers, to "break" these programs would cause
- severe operational problems for the Internet mail system. The
- extensions described in this memo therefore do not rely on little-
- used features of RFC 822. Instead, certain sequences of "ordinary"
- printable ASCII characters (which are assumed to be unlikely to
- otherwise appear in message headers) are reserved for use as encoded
- data. The characters used in these encodings are restricted to those
- which do not have special meanings in the context in which the
- encoded text appears.
-
-
- Notes
-
- This memo relies heavily on notation and terms defined RFC 822 and
- RFC 1341++. In particular, the syntax for the EBNF used in this memo
- is defined in RFC 822, as well as many of the terms used in the
- grammar for the header extensions defined here. Successful
- implementation of this protocol extension requires careful attention
- to the details of both RFC 822 and RFC 1341++.
-
- When the term "ASCII" appears in this memo, it refers to the "7-Bit
- American Standard Code for Information Interchange", ANSI X3.4-1986.
- The MIME charset name for this character set is "US-ASCII". When not
- specifically referring to the MIME charset name, this document uses
- the term "ASCII", both for brevity and for consistency with RFC 822.
- However, implementors are warned that the character set name must be
- spelled "US-ASCII" in MIME message and body part headers.
-
-
- Encodings
-
- An "encoded-word" is a sequence of printable ASCII characters that
- begins with "=?", ends with "?=", and has two "?"s in between. It
- specifies a character set and an encoding method, and also includes
- the original text encoded as ASCII characters, according to the rules
- for that encoding method.
-
- A mail composer that implements this specification will provide a
- means of inputting non-ASCII text in header fields, but will
- translate these fields (or appropriate portions of these fields) into
-
-
-
- K. Moore [Page 2]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- encoded- words before inserting them into the message header.
-
- A mail reader that implements this specification will recognize
- encoded-words when they appear in certain portions of the message
- header. Instead of displaying the encoded-word "as is", it will
- reverse the encoding and display the original text in the designated
- character set.
-
- An "encoded-word" is more precisely defined by the following ABNF
- grammar, using the notation of RFC 822:
-
- encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="
-
- charset = token ; legal charsets defined by RFC 1341++
-
- encoding = token ; Either "B" or "Q"
-
- token = 1*<Any CHAR except SPACE, CTLs, and tspecials>
-
- tspecials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" /
- <"> / "/" / "[" / "]" / "?" / "." / "="
-
- encoded-text = 1*<Any printable ASCII character other than "?" or SPACE>
- ; (but see "Use of encoded-words in message
- ; headers", below)
-
- Both "encoding" and "charset" names are case-independent. Thus
- "ISO-8859-1" is equivalent to "iso-8859-1", and the "Q" encoding may
- be spelled either "Q" or "q".
-
- An encoded-word may not be more than 75 characters long, including
- charset, encoding, encoded-text, and delimiters. If it is desirable
- to encode more text than will fit in an encoded-word of 75
- characters, multiple encoded-words (separated by SPACE or newline)
- may be used. While there is no limit to the length of a multiple-
- line header field, each line of a header field that contains one or
- more encoded-words is limited to 76 characters. NOTE: These
- restrictions are included not only to ease interoperability through
- internetwork mail gateways, but also to impose a limit on the amount
- of lookahead a header parser must employ (while looking for a final
- ?= delimiter) before it can decide whether a token is an encoded-word
- or something else.
-
- Initially, the legal values for "encoding" are "Q" and "B". These
- encodings are described below. The "Q" encoding is recommended for
- use with Latin character sets, and the "B" encoding for all others.
- Nevertheless, a mail reader which claims to recognize encoded-words
- MUST be able to accept either encoding for any character set which it
- supports.
-
- Only a subset of the printable ASCII characters may be used in
-
-
-
- K. Moore [Page 3]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- encoded-text. The SPACE character is not allowed, so that the
- beginning and end of an encoded-word are obvious. The "?" character
- is used within an encoded-word to separate the various portions of
- the encoded-word from one another, and thus cannot appear in the
- encoded-text portion. Other characters are also illegal in certain
- contexts. For example, an encoded-word in a "phrase" preceeding an
- address in a From header field may not contain any of the "specials"
- defined in RFC 822. Finally, certain other characters are disallowed
- in some contexts, to ensure reliability for messages that pass
- through internetwork mail gateways.
-
- The "B" encoding automatically meets these requirements. The "Q"
- encoding allows a wide range of printable characters to be used in
- non-critical locations in the message header (e.g., Subject), with
- fewer characters available for use in other locations.
-
-
- The "B" encoding
-
- The "B" encoding is identical to the "BASE64" encoding defined by
- RFC 1341++.
-
-
- The "Q" encoding
-
- The "Q" encoding is similar to the "Quoted-Printable" content-
- transfer-encoding defined in RFC 1341++. It is designed to allow
- text containing mostly ASCII characters to be decipherable on an
- ASCII terminal without decoding.
-
- 1. Any 8-bit value may be represented by a "=" followed by two
- hexadecimal digits. For example, if the character set in use
- were ISO-8859-1, the "=" character would thus be encoded as
- "=3D", and a SPACE by "=20".
-
- 2. The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
- represented as "_" (underscore, ASCII 95.). (This character may
- not pass through some internetwork mail gateways, but its use
- will greatly enhance readability of "Q" encoded data with mail
- readers that do not support this encoding.) Note that the "_"
- always represents hexadecimal 20, even if the SPACE character
- occupies a different code position in the character set in use.
-
- 3. 8-bit values which correspond to printable ASCII characters other
- than "=", "?", "_" (underscore), and SPACE may be represented as
- those characters. (But see "Use of encoded-words in message
- headers", below).
-
-
-
-
-
-
-
- K. Moore [Page 4]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- Character sets
-
- In an encoded-word, the character set associated with the unencoded text
- is specified by a charset. A charset can be any of the character set
- names allowed in an RFC 1341++ "charset" parameter of a "text/plain"
- body part, or any character set name registered with IANA for use with
- the MIME text/plain content-type. (See section 7.1.1 of RFC 1341++ for
- a list of charsets defined in that document).
-
- Some character sets use code-switching techniques to switch between
- "ASCII mode" and other modes. Display of each encoded-word using such a
- character set implicitly begins in ASCII mode. If the encoded-text in
- an encoded-word contains control codes to switch out of ASCII mode, it
- must also contain additional control codes such that ASCII mode is again
- selected at the end of the encoded-word. (This rule applies separately
- to each encoded-word, including adjacent encoded-words within a single
- header field.)
-
- When there is a possibility of using more than one character set to
- represent the text in an encoded-word, and in the absence of private
- agreements between sender and recipients of a message, it is recommended
- that members of the ISO-8859-* series be used in preference to other
- character sets. Among the various ISO-8859-* character sets, the
- lowest-numbered set which contains all of the required characters should
- be used.
-
-
- Use of encoded-words in message headers
-
- A sequence of one or more encoded-words is used to represent non- ASCII
- textual data within a header field. An encoded-word must be separated
- from an adjacent encoded-word, "word", "text", "ctext", or "special" by
- a linear white-space character or a newline. When displaying a
- particular header field that contains multiple encoded-words, any
- linear-white-space that separates a pair of adjacent encoded-words is
- ignored. (This is to allow the use of multiple encoded-words to
- represent long strings of unencoded text, without having to separate
- encoded-words where spaces occur in the unencoded text.)
-
- Each encoded-word must represent an integral number of characters; a
- character may not be split across adjacent encoded-words.
-
- An encoded-word may appear in a message header or body part header
- according to the following rules:
-
- - An encoded-word may replace a "text" token (as defined by RFC 822) in:
- (1) a Subject or Comments header field, (2) any extension message
- header field, (3) any user-defined message header field, or (4) any
- RFC 1341++ body part header field (such as Content-Description) for
- which the field body contains only "text"s.
-
-
-
-
- K. Moore [Page 5]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- - An encoded-word may appear within a comment delimited by "(" and ")",
- i.e., wherever a "ctext" is allowed. More precisely, the RFC 822 ABNF
- definition for "comment" is amended as follows:
-
- comment = "(" *(ctext / quoted-pair / comment / encoded-word) ")"
-
- A "Q"-encoded encoded-word which appears in a comment MUST NOT contain
- the characters "(", ")" or "\".
-
- - As a replacement for a "word" entity within a "phrase", for example,
- one that precedes an address in a From, To, or Cc header. The ABNF
- definition for phrase from RFC 822 thus becomes:
-
- phrase = 1*(encoded-word / word)
-
- In this case the set of characters that may be used in a "Q"-encoded
- encoded-word is restricted to: <upper and lower case ASCII letters,
- decimal digits, "!", "*", "+", "-", "/", "=", and "_" (underscore,
- ASCII 95.)>.
-
- These are the ONLY locations where an encoded-word may appear. In
- particular, an encoded-word MUST NOT appear in any portion of an
- "address". In addition, an encoded-word MUST NOT be used in a
- Received header field.
-
- Whenever such words appear in a header being displayed, an enlightened
- mail reader will decode the text and render it appropriately.
-
- Only textual data (printable and white space characters) should be
- encoded using this scheme. However, since these encoding schemes
- allow the encoding of arbitrary 8-bit values, mail readers that
- implement this decoding should also ensure that display of the decoded
- data on the recipient's terminal will not cause unwanted side-effects.
-
- Use of these methods to encode non-textual data (e.g., pictures or
- sounds) is not defined by this memo. Use of encoded-words to
- represent strings of purely ASCII characters is allowed, but
- discouraged.
-
-
- Recognition of encoded-words in message headers.
-
- An encoded-word may be distinguished from an ordinary "word", "text", or
- "ctext", as follows: An encoded-word begins with "=?", ends with "?=",
- contains exactly four "?" characters including the delimiters, and is
- followed by a SPACE or newline. If the "word", "text", or "ctext" does
- not meet the above tests, it should be displayed as it appears in the
- message header.
-
- If the mail reader does not support the character set used, it may
- either display the encoded-word as ordinary text (i.e., as it appears in
-
-
-
- K. Moore [Page 6]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- the header), or it may substitute an appropriate message indicating that
- the decoded text could not be displayed.
-
-
- Conformance
-
- A mail composing program claiming compliance with this specification
- MUST ensure that any string of printable ASCII characters in a "text" or
- "ctext" entity within a header, or any "atom" within a "phrase", that
- begins with "=?" and ends with "?=" be a valid encoded-word.
-
- A mail reading program claiming compliance with this specification must
- be able to distinguish encoded-words from "text", "ctext", or "word"s
- anytime they appear in appropriate places in message headers. In
- addition, the program must be able to display the unencoded text if the
- character set is "US-ASCII". For the ISO-8859-* character sets, the
- mail reading program must at least be able to display the characters
- which are also in the ASCII set.
-
-
- Examples
-
- From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
- To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
- CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
- Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
- =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
-
- From: =?ISO-8859-1?Q?Olle_J=E4rnefors?= <ojarnef@admin.kth.se>
- To: ietf-822@dimacs.rutgers.edu, ojarnef@admin.kth.se
- Subject: Time for ISO 10646?
-
- To: Dave Crocker <dcrocker@mordor.stanford.edu>
- Cc: ietf-822@dimacs.rutgers.edu, paf@comsol.se
- From: =?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?= <paf@nada.kth.se>
- Subject: Re: RFC-HDR care and feeding
-
- From: Nathaniel Borenstein <nsb@thumper.bellcore.com>
- (=?iso-8859-8?b?7eXs+SDv4SDp7Oj08A==?=)
- To: Greg Vaudreuil <gvaudre@NRI.Reston.VA.US>, Ned Freed
- <ned@innosoft.com>, Keith Moore <moore@cs.utk.edu>
- Subject: Test of new header generator
- MIME-Version: 1.0
- Content-type: text/plain; charset=ISO-8859-1
-
-
-
- References
-
- [1] Borenstein N., and N. Freed, "MIME (Multipurpose Internet Mail
- Extensions) Part One: Mechanisms for Specifying and Describing the
-
-
-
- K. Moore [Page 7]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- Format of Internet Message Bodies", Internet-Draft RFC 1341++,
- Bellcore, Innosoft, March 1993.
-
- [2] Crocker, D., "Standard for the Format of ARPA Internet Text
- Messages", RFC 822, UDEL, August 1982.
-
-
- Security Considerations
-
- Security issues are not discussed in this memo.
-
-
- Author's Address
-
- Keith Moore
- University of Tennessee
- 107 Ayres Hall
- Knoxville TN 37996-1301
-
- EMail: moore@cs.utk.edu
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- K. Moore [Page 8]
- Internet Draft Expires 22 September 1993 22 March 1993
-
-
-
- Appendix - changes from RFC 1342
-
-
- 1. Title changed to say "MIME Part 2".
-
- 2. Character sets allowed to include IANA-registered charsets in
- addition to those defined in RFC 1341++. (X-* charsets are still
- excluded.)
-
- 3. RFC 1342 said, in effect, "don't display a space or newline
- following an encoded-word". This memo says, "don't display any
- linear-white-space between adjacent encoded-words."
-
- 4. Each encoded-word must now contain an integral number of
- characters.
-
- 5. Added language about charsets that use code-switching techniques.
-
- 6. "Compliance" paragraph changed -- =?something?= is now only
- required to be a valid encoded-word if the =?something?= is either
- contained in a "text", or is an "atom" within a "phrase".
-
- 7. Clarified the 76 character per line limit (it's per line, not per
- header field).
-
- 8. Define "charset" and "encoding" to be case-independent.
-
- 9. Added note to explicitly refer the reader to RFC 822 and
- RFC 1341++.
-
- 10. Added note re: usage of "ASCII" vs. "US-ASCII".
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- K. Moore [Page 9]
-